Skip to content

Conversation

@hksdpc255
Copy link

@hksdpc255 hksdpc255 commented Nov 2, 2025

Generalized and streaming-capable XML-style tool-call parsing with grammar enforcement and automatic template fixing.

Based on PR #15904, this patch introduces a generalized implementation for almost all XML-style tool-call formats.

Supported models

  • GLM 4.5/4.6
  • MiniMax M2
  • SeedOSS
  • Kimi-K2 (Thinking and non-thinking)
  • Qwen3-Coder (Thinking and non-thinking)
  • Apriel-1.5
  • Xiaomi-MiMo

Grammar-constrained tool-call outputs

Tool-call messages generated by the model are now strictly validated against a defined grammar.
A new automatic grammar generator simplifies the process of creating grammars for new models.
This ensures that all tool-call outputs are well-formed, structurally consistent, and reliably parsed.

Streaming support for tool-call parsing

The parser now supports streaming parsing, enabling incremental processing of tool-call messages as they are generated.
This enhancement improves responsiveness and allows real-time interaction during model inference.

Automatic chat-template fixing

A lightweight Jinja2-based patcher has been added to automatically fix official chat templates before use.
With this change, official templates now work out of the box, eliminating the need for custom modifications.

In-context reasoning

The parser now supports multiple reasoning blocks within a single generation, even when interleaved with tool calls.
All reasoning content is preserved. No information is lost during parsing or streaming.

Enhanced unit tests

Add unit test for streaming-mode parser. It simulates the generation phase by feeding content character-by-character, comparing the parsed results and verifying that streaming and non-streaming modes reach the same final state.

Additional Notes

  • All unit tests have passed.
  • Community testing is welcome! Please try it out with your model integrations.
  • If your OpenAI-compatible client does not support sending reasoning_content back to the server, use the option --reasoning-format none
  • When reporting issues, it’s recommended to add -lv 1 in the command line to enable more detailed logging.

Please use the chat template included in this PR, or any other chat template that you are certain will work correctly

@MikeLP
Copy link

MikeLP commented Nov 2, 2025

I'm looking forward to get this PR merged!

@hksdpc255 Does it require a custom jinja template from the previous PR or it works good as is?

@hksdpc255
Copy link
Author

hksdpc255 commented Nov 2, 2025

For now, I’d recommend using a custom template if you’re running more complex workloads.
As for the embedded/official template, it won’t fail at the start, but it may be missing some features that your agent requires.

Edit: The official template is now working properly. There’s no longer need for a custom template.

Edit2: Official template support for Minimax-M2 has been removed. See comment and ochafik/minja#7 (comment) for details.

@ochafik
Copy link
Collaborator

ochafik commented Nov 2, 2025

FYI I've updated (my fork of) Minja w/ support for GLM 4.6's template.
Might affect how you deal w/ the polyfills, as it should now detect GLM's tool call capability properly.

@hksdpc255
Copy link
Author

@ochafik Excellent work! Once llama.cpp syncs your changes, some parts of this PR can be safely removed.

However, there are still a few small patches needed — for example, replacing dict.items() with dict | items.

@hksdpc255
Copy link
Author

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@ochafik
Copy link
Collaborator

ochafik commented Nov 3, 2025

Currently, the official Minimax-M2 chat template fails to run tool calls because dict.items() and list[-1] are not supported by llama.cpp’s Jinja2 rendering engine.

@hksdpc255 Both should be supported. The confusing error you probably got was because minja implements items() on dict but not on str. It should detect whether the template expects arguments to be an object instead of a more common json string of said object (see requires_object_arguments), and adjust the inputs accordingly: now hopefully works for GLM 4.6.

As for list[-1], it's supported, but MinMax M2's template has a bug, see this comment.

And please feel free to file bugs on https://github.com/ochafik/minja, it's should be cleaner to add syntax support there than to patch things up in llama.cpp.

@hksdpc255
Copy link
Author

@ochafik Thank you for pointing that out. I’m currently applying your suggested fix in llama.cpp and will test whether it works as expected. Thanks again for the help!

@hksdpc255
Copy link
Author

Good news! The Minimax M2 tool call is now working.

I’ll push the fix later.

@hksdpc255
Copy link
Author

hksdpc255 commented Nov 3, 2025

Screen shot for Zed editor: 图片

Model: unsloth's UD-Q3_K_XL

@hksdpc255 hksdpc255 mentioned this pull request Nov 3, 2025
@emuchogu
Copy link

emuchogu commented Nov 3, 2025

Hi @hksdpc255 ,
I cloned your repo https://github.com/hksdpc255/llama.cpp/tree/xml_toolcall and unfortunately it's still not producing the initial think tag at least in the cli. See below.

Model: unsloth--MiniMax-M2-GGUF Q8_0

./llama-cli \
  -m /models/hub/models--unsloth--MiniMax-M2-GGUF/snapshots/*/Q8_0/MiniMax-M2-Q8_0-00001-of-00005.gguf \
  -ngl 99 \
  -sm layer \
  -ts 1,1,1,1,1,1,1,1 \
  -c 78000 \
  -t 16 \
  --jinja \
  -i

Output:

> what is the capital of france?
Okay, the user asked a straightforward question: "What is the capital of France?" This is basic geography knowledge, so the answer should be simple. I don't need to overcomplicate things. 

Hmm, maybe the user is just testing if I know basic facts, or perhaps they're new to this kind of question. Either way, the response should be clear and concise. No need for extra details unless they ask follow-ups. 

I recall that Paris is the capital of France. It's one of the most well-known capitals globally, so this should be an easy one. The user might be a student working on homework, or someone prepping for trivia. Or maybe they're just curious—either way, I should confirm it confidently. 

No signs of confusion or deeper needs here. The question is very direct. I'll just state the answer plainly. If they want more info later, like landmarks or history, they'll ask. For now, keep it simple: Paris is the capital. 

Wait, should I add that it's also a major cultural hub? Nah, overcomplicating it. Just the fact. Done.
</think>

The capital of France is **Paris**. 

Paris is not only the political center but also a major cultural, economic, and gastronomic hub, famous for landmarks like the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Champs-Élysées.

@hksdpc255
Copy link
Author

@emuchogu Sorry, I haven’t tested it with llama-cli — only with llama-server.

If you want <think> and </think> to appear in the content, append --reasoning-format none when running llama-server.

I’m not sure whether llama-cli uses the same parsing logic.

ServeurpersoCom added a commit to ServeurpersoCom/llama.cpp that referenced this pull request Nov 3, 2025
@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

I’ve reverted my previous PR (reasoning-format-minimax-m2) and merged PR #16932 into my testing-branch16 for isolated testing.
I’m running llama-swap with the new XML tool-call parser to check MiniMax-M2 compatibility without any synthetic injection, using --reasoning-format none to observe the parser’s raw behavior.

sendLoadingState: true

macros:
  llama-server: >
    ../llama.cpp.pascal/build/bin/llama-server
    --port 8081
    -ngl 999
    -ctk q8_0
    -ctv q8_0
    -fa on
    --mlock
    -np 1
    --jinja
  models: /var/www/ia/models
  proxy: http://127.0.0.1:8081

  MoE-MiniMax-M2-230B-A10B:
    cmd: |
      ${llama-server}
      -m ${models}/unsloth/MiniMax-M2-GGUF/MiniMax-M2-UD-Q2_K_XL-00001-of-00002.gguf
      --temp 1.0
      --top-p 0.95
      --top-k 40
      --n-cpu-moe 50
      --ctx-size 65536
      --reasoning-format none
    proxy: ${proxy}
    filters:
      strip_params: "temperature, top_p, top_k"

Without this PR :

Streaming, no initial <think> tag in the output:
Sans titre

Curl without streaming no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1192  100   973  100   219    259     58  0:00:03  0:00:03 --:--:--   317
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "The user asks: \"What is the capital of France?\" The answer is Paris. This is a simple question. There's no disallowed content. So the answer is \"Paris.\" Possibly also mention that it's Paris. So answer: \"The capital of France is Paris.\" There's no reason to go beyond that. There's no conflict with policy. So final answer: \"Paris.\"\n</think>\n\nThe capital of France is **Paris**."
      }
    }
  ],
  "created": 1762152163,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6942-5698549e7",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 85,
    "prompt_tokens": 29,
    "total_tokens": 114
  },
  "id": "chatcmpl-gfe455eld4ThdT1D7Ji6CtuJm6md4V7W",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 273.966,
    "prompt_per_token_ms": 19.569,
    "prompt_per_second": 51.1012315396801,
    "predicted_n": 85,
    "predicted_ms": 3458.452,
    "predicted_per_token_ms": 40.6876705882353,
    "predicted_per_second": 24.577469920068282
  }
}
(root|~/llama.cpp.pascal)

With this PR :

Streaming :
reasoning go inside reasoning_content :
Sans titre

Curl without streaming, no initial <think> tag in the output :

(root|~/llama.cpp.pascal) curl http://127.0.0.1:8081/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "MoE-MiniMax-M2-230B-A10B",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 40,
    "stream": false
  }' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1265  100  1046  100   219    251     52  0:00:04  0:00:04 --:--:--   304
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm looking at how to respond to the question: \"What is the capital of France?\" The user expects a straightforward answer, which is \"Paris.\" I’ll keep it simple and concise, but I might consider adding a brief note about the Eiffel Tower. However, since the user didn't ask for extra information, I’ll focus on just saying \"Paris\" to fulfill their request. I want to ensure I’m following their guidelines accurately.\n</think>\n\nParis."
      }
    }
  ],
  "created": 1762152603,
  "model": "MoE-MiniMax-M2-230B-A10B",
  "system_fingerprint": "b6943-0619a5b7d",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 92,
    "prompt_tokens": 29,
    "total_tokens": 121
  },
  "id": "chatcmpl-WqvR2S73aa7cZEyIN7lm42yuuatYZwqO",
  "timings": {
    "cache_n": 15,
    "prompt_n": 14,
    "prompt_ms": 278.533,
    "prompt_per_token_ms": 19.895214285714285,
    "prompt_per_second": 50.263344020277664,
    "predicted_n": 92,
    "predicted_ms": 3852.551,
    "predicted_per_token_ms": 41.87555434782609,
    "predicted_per_second": 23.88028088401685
  }
}
(root|~/llama.cpp.pascal)

@hksdpc255
Copy link
Author

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

@ServeurpersoCom
Copy link
Collaborator

Oh! It seems you’re using non-streaming mode. I can now reproduce your issue with stream: false.

Let me dig into what’s happening…

Yes, exactly: it works correctly in streaming mode (tested through the SvelteUI, which specifically designed to be debug-friendly without needing curl -N), but not in non-streaming mode.
So the initial tag still doesn’t appear when stream: false.

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

Toolcall debug on SvelteUI with your #16932 + #16618 :)

Custom JSON :

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "simple_addition_tool",
        "description": "A dummy calculator tool used for testing multi-argument tool call streaming.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": {
              "type": "number",
              "description": "The first number to add."
            },
            "b": {
              "type": "number",
              "description": "The second number to add."
            }
          },
          "required": ["a", "b"]
        }
      }
    }
  ]
}
Sans titre Sans titre2

@hksdpc255
Copy link
Author

hksdpc255 commented Nov 3, 2025

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

llama.cpp/common/chat.cpp

Lines 2748 to 2753 in af5216e

if (!builder.syntax().parse_tool_calls) {
// MiniMax-M2 uses <think>...</think> tags for reasoning content
builder.try_parse_reasoning("<think>", "</think>");
builder.add_content(builder.consume_rest());
return;
}

Simply deleting the code above should fix the issue. I’ll run more tests before pushing a new commit.

图片

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

@ServeurpersoCom The problem is that I added some code that makes it fall back to llama.cpp’s original parser when there are no tools, so the new parser is never called.

I’ve successfully tested it without these lines of code and confirmed it works as expected for streaming / non streaming / reasoning_content / toolcall

@ServeurpersoCom
Copy link
Collaborator

ServeurpersoCom commented Nov 3, 2025

I just realized this, and it seems strange: shouldn’t --reasoning-format none completely bypass any parsing logic instead of still going through it? It’s meant to be the raw passthrough mode for observing the model’s native output.

The .cpp files are already becoming huge and monolithic, making them harder to touch or refactor safely. The --reasoning-format options are also poorly named and not very explicit. In the long run, a modular templating system would help avoid piling up even more C++ parsing code.

If this work is meant to unify several next-generation parsers, maybe we could add a new keyword to --reasoning-format instead? It’s important to keep none as a truly no-parsing mode, since it’s essential for debugging new models.

Also, the current "auto" mode is actually just "deepseek" in practice, so it might be clearer to rename or document it that way to avoid confusion: and your unified detection logic could be implemented directly under auto (or deepseek, since they’re basically aliases) ?

@semidark
Copy link
Contributor

semidark commented Nov 15, 2025

I build and ran your fork/branch successfully on my strix halo AMD APU.

llama-server --host 0.0.0.0 --threads -1 -m $HOME/models/MiniMax-M2/MiniMax-M2-UD-Q3_K_XL-00001-of-00003.gguf -fa on --top-p 0.95 --top-k 40 --temp 1.0 --alias MiniMax-M2 --jinja --cache-type-k q4_0 --cache-type-v q4_0 -np 1 --ctx-size 204000 --reasoning-format none

The reasoning now correctly starts with the <think> tag and the llama-server WebUI and OpenWebUI both interpret the LLMs output as expected.

I tried the web search tool in OpenWebUI. It works with gpt-oss-120b but not with MiniMax-M2.

I will later have a deeper look into this. I skimmed the comments in this PR and it said something about customizing the chat template. If anyone has a more specific hint, this of course would be welcome.

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

In this case, change _args.items() to _args | items in the template may work for you.

I doubt that will work, the problem isn't .items() but rather _args.

The llama.cpp maintainers suggested that I should not patch chat templates for known unsupported patterns during loading, so I have removed that logic. Users will need to modify the templates themselves if they rely on these patterns.

Don't livepatch, but it's ok to provide a patched template in models/templates.

@hksdpc255
Copy link
Author

hksdpc255 commented Nov 15, 2025

@semidark As you can see from the maintainers’ comments (cc CISC), I cannot fix the official chat template by live patch. Please make sure you’re using the chat template provided in this PR. You may need to add a parameter such as --chat-template-file xxx.jinja when launching llama-server. Minimax-M2's builtin template is known to crash, GLM's builtin template remains unknown because it depends on what the gguf you're using.

@semidark
Copy link
Contributor

semidark commented Nov 15, 2025

@hksdpc255

Thanks for the hint. I just builded a container with the patch and did not even have a look at the provided files in this PR 😬. Sorry for that. I was so hyped to get this up and running, that I fiddled it all together on my android Smartphone with termux as SSH client. 📳

I will try MiniMax-M2.jinja then.


UPDATE: WebSearch Tool call in OpenWebUI now works with the MiniMax-M2.jinja template. 🤩

@hksdpc255
Copy link
Author

hksdpc255 commented Nov 15, 2025

@lainwir3d Could you share a bit more information about how you’re running llama-server? The chat context will also be helpfull.

@hksdpc255
Copy link
Author

@lainwir3d Try the new template, I'm not sure if it will fix your problem.

@lainwir3d
Copy link

hey @hksdpc255 , here is how I start it. Will try with template fix now

llama-server \
  --host 10.42.0.100 \
  --port 10042 \
  --jinja \
  --threads -1 \
  --n-gpu-layers 99 \
  --ctx-size 131072 \
  --flash-attn on \
  --temp 0.4 \
  --min-p 0.05 \
  --top-p 0.95 \
  --top-k 40 \
  --prio 3 \
  --cache-type-v q4_0 \
  --cache-type-k q4_0 \
  --no-mmap \
  --repeat_penalty 1.1 \
  --batch-size 512 \
  --tensor-split 20,20,20,20 \
  --model ~/LLMs/glm_air_4.5/GLM-4.5-Air-Q4_K_M-00001-of-00002.gguf \
  --alias "unsloth/GLM-air-4.5" \
  --chat-template-file llama.cpp-templatefix/models/templates/GLM-4.6.jinja

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

@lainwir3d Try the new template, I'm not sure if it will fix your problem.

A simpler fix is just fixing the set, or wrapping it in if, like tc.function just above it, I suspect arguments is undefined:

{% set _args = tc.arguments or {} %}

@hksdpc255
Copy link
Author

@lainwir3d Try the new template, I'm not sure if it will fix your problem.

A simpler fix is just fixing the set, or wrapping it in if, like tc.function just above it, I suspect arguments is undefined:

{% set _args = tc.arguments or {} %}

Thank you for your help. I will apply your fix instead.

@lainwir3d
Copy link

Not better here:

%c[@continuedev] error: 500 Unknown method: items at row 75, column 22:
%c{% set _args = tc.arguments or {} %}
%c{% for k, v in _args.items() %}
%c                     ^
%c<arg_key>{{ k }}</arg_key>
%c at row 75, column 1:
%c{% set _args = tc.arguments or {} %}
%c{% for k, v in _args.items() %}
%c^
%c<arg_key>{{ k }}</arg_key>
%c at row 69, column 29:
%c{% if m.tool_calls %}
%c{% for tc in m.tool_calls %}
%c                            ^
%c{%- if tc.function %}
%c at row 69, column 1:
%c{% if m.tool_calls %}
%c{% for tc in m.tool_calls %}
%c^
%c{%- if tc.function %}
%c at row 68, column 22:
%c{%- endif -%}
%c{% if m.tool_calls %}
%c                     ^
%c{% for tc in m.tool_calls %}
%c at row 68, column 1:
%c{%- endif -%}
%c{% if m.tool_calls %}
%c^
%c{% for tc in m.tool_calls %}
%c at row 48, column 35:
%c{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("/nothink")) else '' -}}
%c{%- elif m.role == 'assistant' -%}
%c                                  ^
%c<|assistant|>
%c at row 45, column 1:
%c{% for m in messages %}
%c{%- if m.role == 'user' -%}<|user|>
%c^
%c{{ visible_text(m.content) }}
%c at row 44, column 24:
%c{%- endfor %}
%c{% for m in messages %}
%c                       ^
%c{%- if m.role == 'user' -%}<|user|>
%c at row 44, column 1:
%c{%- endfor %}
%c{% for m in messages %}
%c^
%c{%- if m.role == 'user' -%}<|user|>
%c at row 1, column 1:
%c[gMASK]<sop>
%c^
%c{%- if tools -%}
%c {"context":"llm_stream_chat","model":"unsloth/glm-4.5-air","provider":"openai","useOpenAIAdapter":true,"streamEnabled":true,"templateMessages":false}

Is that embedded / inline python? I can maybe try to fix myself

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

Not better here:

Strange, that would indicate that tc.arguments is something other than an object, in which case continue.dev is likely messing it up.

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

@lainwir3d if you can manage to output the whole messages that would be helpful, but it's starting to look like a bug in continue.dev.

Edit: Could of course also be that the model decides to output an unparsable tool call, which would end up a string.

Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, ready to merge once @pwilkin and @aldehir approve.

@aldehir
Copy link
Collaborator

aldehir commented Nov 15, 2025

Full disclaimer: I am actively developing a different parsing approach (#17136) than what is currently in place. It overlaps with the work done here.

That said, we can probably merge this in. My PR needs further fleshing out and consensus. @pwilkin thoughts?

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

Full disclaimer: I am actively developing a different parsing approach (#17136) than what is currently in place. It overlaps with the work done here.

That said, we can probably merge this in until it gets fully fleshed out and merged (with buy-in). @pwilkin thoughts?

Very cool, and probably much needed. I'd imagine it takes a little effort to add the necessary logic for all the supported models though, so nice to have this added to the baseline first?

@aldehir
Copy link
Collaborator

aldehir commented Nov 15, 2025

Very cool, and probably much needed. I'd imagine it takes a little effort to add the necessary logic for all the supported models though, so nice to have this added to the baseline first?

Agreed, there is clear desire from the community for these models.

@lainwir3d
Copy link

lainwir3d commented Nov 15, 2025

Looks like a tool call is failing.

Could this be the issue? Sorry, trying to get my head around this I'm new to all this LLMs / tool calls / etc stuff

{
          "role": "user",
          "content": "find the xxxx frontend main.cpp and rename function \"calibrate()\" to \"calib()\""
        },
        {
          "role": "assistant",
          "content": "I'll help you find the xxxx frontend main.cpp file and rename the \"calibrate()\" function to \"calib()\". Let me start by searching for the file.",
          "toolCalls": [
            {
              "id": "QyD49scltJgIRMvCbGHk5bbKjMyxRyZZ",
              "type": "function",
              "function": {
                "name": "file_glob_search",
                "arguments": "{\"pattern\":\"**/main.cpp\"}"
              }
            }
          ]
        },
        {
          "role": "tool",
          "content": "frontend/yyyy/src/main.cpp\nfrontend/xxxx/src/main.cpp\nfrontend/tools/zzzz/main.cpp\nfrontend/aaaa/src/main.cpp\n",
          "toolCallId": "QyD49scltJgIRMvCbGHk5bbKjMyxRyZZ"
        },
        {
          "role": "assistant",
          "content": " ",
          "toolCalls": [
            {
              "id": "wMgqfC5XnamTTNwp0iThbBK52gtRJdJG",
              "type": "function",
              "function": {
                "name": "read_file",
                "arguments": "{{\"filepath\":\"frontend/xxxx/src/main.cpp\"}"
              }
            }
          ]
        },
        {
          "role": "tool",
          "content": "read_file failed with the message: `filepath` argument is required and must not be empty or whitespace-only. (type string)\n\nPlease try something else or request further instructions.",
          "toolCallId": "wMgqfC5XnamTTNwp0iThbBK52gtRJdJG"
        }

arguments for read_file seems broken...

@CISC
Copy link
Collaborator

CISC commented Nov 15, 2025

Looks like a tool call is failing.

Could this be the issue? Sorry, trying to get my head around this I'm new to all this LLMs / tool calls / etc stuff

arguments for read_file seems broken...

Yep, that would do it.

@pwilkin
Copy link
Collaborator

pwilkin commented Nov 15, 2025

Yes, I imagine the comprehensive refactor @aldehir is working on will take some time and in the meanwhile, people really want a working version of tool calling for those models, so I'll say let's go for it, especially since @hksdpc255 put in a lot of work and accomodated all the suggestions.

hksdpc255 and others added 2 commits November 16, 2025 10:04
Co-authored-by: Sigbjørn Skjæret <[email protected]>
@hksdpc255
Copy link
Author

hksdpc255 commented Nov 16, 2025

Not better here:

%c[@continuedev] error: 500 Unknown method: items at row 75, column 22:
%c{% set _args = tc.arguments or {} %}
%c{% for k, v in _args.items() %}
%c                     ^
%c<arg_key>{{ k }}</arg_key>
%c at row 75, column 1:
%c{% set _args = tc.arguments or {} %}
%c{% for k, v in _args.items() %}
%c^
%c<arg_key>{{ k }}</arg_key>
%c at row 69, column 29:
%c{% if m.tool_calls %}
%c{% for tc in m.tool_calls %}
%c                            ^
%c{%- if tc.function %}
%c at row 69, column 1:
%c{% if m.tool_calls %}
%c{% for tc in m.tool_calls %}
%c^
%c{%- if tc.function %}
%c at row 68, column 22:
%c{%- endif -%}
%c{% if m.tool_calls %}
%c                     ^
%c{% for tc in m.tool_calls %}
%c at row 68, column 1:
%c{%- endif -%}
%c{% if m.tool_calls %}
%c^
%c{% for tc in m.tool_calls %}
%c at row 48, column 35:
%c{{- '/nothink' if (enable_thinking is defined and not enable_thinking and not visible_text(m.content).endswith("/nothink")) else '' -}}
%c{%- elif m.role == 'assistant' -%}
%c                                  ^
%c<|assistant|>
%c at row 45, column 1:
%c{% for m in messages %}
%c{%- if m.role == 'user' -%}<|user|>
%c^
%c{{ visible_text(m.content) }}
%c at row 44, column 24:
%c{%- endfor %}
%c{% for m in messages %}
%c                       ^
%c{%- if m.role == 'user' -%}<|user|>
%c at row 44, column 1:
%c{%- endfor %}
%c{% for m in messages %}
%c^
%c{%- if m.role == 'user' -%}<|user|>
%c at row 1, column 1:
%c[gMASK]<sop>
%c^
%c{%- if tools -%}
%c {"context":"llm_stream_chat","model":"unsloth/glm-4.5-air","provider":"openai","useOpenAIAdapter":true,"streamEnabled":true,"templateMessages":false}

Is that embedded / inline python? I can maybe try to fix myself

Could you try this?
https://github.com/ggml-org/llama.cpp/blob/ea4f0ac2dac4441a6d860b9ae2b9d6d0dbdec4d7/models/templates/GLM-4.6.jinja

Or even further, use the fixed template here: #15904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.